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Abstract 

In computational biology, gene expression datasets are characterized by very few individ- 
ual samples compared to a large number of measurements per sample. Thus, it is appealing 
to merge these datasets in order to increase the number of observations and diversify the 
data, allowing a more reliable selection of genes relevant to the biological problem. Besides, 
the increased size of a merged dataset facilitates its re-splitting into training and validation 
sets. This necessitates the introduction of the dataset as a random effect. In this context, 



extending a work of lLee et al.l |2003i] , a method is proposed to select relevant variables among 
tens of thousands in a probit mixed regression model, considered as part of a larger hierar- 
chical Bayesian model. Latent variables are u s ed to identify subsets of selected variables and 



the grouping (or blocking) technique of LirJ 1994| is combined with a Metropolis-within 



Gibbs algorithm Robert and Casella . 2004| . The method is applied to a merged dataset 



made of three individual gene expression datasets, in which tens of thousands of measure- 
ments are available for each of several hundred human breast cancer samples. Even for this 
large dataset comprised of around 20000 predictors, the method is shown to be efficient and 
feasible. As an illustration, it is used to select the most important genes that characterize 
the estrogen receptor status of patients with breast cancer. 

Keywords: Bayesian variable selection, random effects, probit mixed regression model, grouping 
technique (or blocking technique). Metropolis- within-Gibbs algorithm. 



1 Introduction 

Selection of variables is a common problem in many scientific fields, and particularly in bioinfor- 
matics. Gene expression profiling analyses are notorious for generating a very large number of 
predictors compared to the number of observations. Microarray or high throughput sequencing 
technologies are important for finding genes that are implicated in biological processes including 
development, disease, and response to treatment, and it plays an important role in the current 
tendency towards personalized medicine. Identified genes or sequences can be used to classify 
future observations, influencing the treatment of patients. However, these experiments are ex- 
pensive, and datasets have often no more than 100 specimens. The goal, therefore, is to advance 
a method allowing variable selection from merged microarray datasets, each of them presenting 
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its own individual experimental bias. 



Several model-based approaches have been developed to select variables. A well-known 
example is SVM (S u pport Vector Machine) with a rec ursive feature elimina tion of the genes 
(jGuvon et al.l [2OO4]). iGeorge and McCullochI |l993l | and lChipman et al.l [200 1[ | developed Bayesian 



variable selection wit h the use of Gibbs sampling fo r linear models ; a rev iew of this type of se- 
lection is provided by O'Hara and Sillanpaa.1 2009l |. Tadesse et al. 2005 ] proposed a Bayesian 



variable selectio n in a model-based clustering a pproach, using a multivariate Gaussian mixture 
model. Recently Bottolo and Richardson 201[lt | proposed an algorithm based upon Evolutionary 
Monte Carlo. Binary responses are often encountered in biostatistics studies, therefore probit 
or logistic mod el s are imp l ied. B ayesian var i able se l ection meth o ds hav e be en proposed by 
Lee et al.l i2003l |. ISha et al.l l2004l]. Izhou et all l2004bll Izhou et al.l fiooi^l and lYang and Song 
'Mod for probit regression, and bv lZhou et alT [2004a] . Ichen and D ev [20oi] and lXiichleil |2008l | 



for logistic regression. Extension to multi-category data has been done for the probit model in 
Albert and ChibI |l993l |. 



The motivation behind the variable selection method developed in this paper is to take the 
design of the study into account by using random effects in a mixed model. It is particularly 
suited to a merged micro array dataset desi gn, and many such datasets are freely available from 
the NCBI GEO website [Edga r et al. l. l2002l |. The increased size of a merged dataset may provide 
improved power, and facilitates its re-splitting into training and validation sets. In addition a 
merged set comprises more data diversity than an individu al set, hence we can avoid bias due 
to a particular dataset as explained by various authors, see ICheng et al.l |2O10l | and references 
therei n. Among all the methods previously proposed for variable selection, that of iTiichler 
20081 ] considered mixed models. However, her approach was specific for logistic models, and 



the method was applied to datasets with only few dozens predictors, whereas the aim of this 
paper is to select a few predictors among tens of thousands in a Bayesian framework. Recently 
Friihwirth-Schnatter and Wagner 20ld ] considered variable selection for random effects, but in 
this paper we are more interested by variable selection for the fixed effects, assuming that ran- 
dom effects are present. 



The approach de v eloped in this paper extends th e approach of lGeorge and McCullochI jl993l ] 
and Lee et al. 20031 ]. George and McCulloch 1993) introduced latent variables to identify sub- 
sets of selected variables in a linear model. Then Lee et al. 20031 ] used these latent variables in a 
probit regression model, which is c onsidered as part of a larger hierarchical Bayesian model. Our 
method extends the model used bv lLee et al.l 20031 ] by adding random effects. We are then con- 
fronted with several difficulties. One concerns the simulation of conditional distributions, since 
full conditional distributio ns cannot be directly simulated. A solution is to use the grouping 
(or blocking) technique of iLiul 19941 ]. and to combine Gibbs sampler and Metropolis-Hastings 
algorithms. Therefore the algorithm develop ed is a combination of the grouping method of Liu 
and the Metropolis- within-Gibbs algorithm [Robert and Casellal . 120041 ]. A computational diffi- 
culty due to the large number of genes had also been overcome by imposing a fixed number of 
selected genes at each iteration of the algorithm. As a consequence the influence of the value 
chosen for the variable selection coefficient of our model is reduced. That represents an advan- 
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tage, since the value of this co efficient can impact the results of other methods, see for instance 
Bottolo and Richardson 20101 ] who proposed to put a hyperprior distribution on this coefficient. 



In this paper, Affymetrix microarray data are used, so predictors (genes) will be referred to 
as "probesets", according to that technology. An Affymetrix U133plus2 microarray proffies all 
of the genes in the human genome, many of them more than once, using over 54000 gene-specific 
"probesets". Our Bayesian variable selection method for probit mixed models is developed to 
select a few important probesets, among tens of thousands, which are indicative of the activity 
of the estrogen receptor gene in breast cancer. The severity of this common and deadly disease is 
directly related to estrogen receptor (ER) status, which is traditionally measured biochemically. 
Three different breast cancer datasets were used, all with clinically defined ER status. One 
microarray experiment was done per patient, and ten of thousands of probesets were measured 
per experiment. The dataset is introduced as a random effect in the model, thus accounting 
for the different experimental conditions implicit in each set. The three merged datasets were 
split into training and validation sets, and the relevance of the selected probesets was checked 
by fitting a probit mixed model on the training set and predicting the ER status for the patients 
from the validation set and other independent sets available from the NCBI GEO website. The 
stability and the sensit ivity of the algorithm were a lso checked by using the relative weighted 
consistency measure of Somol and Novovicova 2008l |. 



The remainder of the paper is organized as follows. Section 2 describes the probit mixed 
model with latent variables. Section 3 gives the full conditional distributions necessary for the 
Gibbs sampling algorithm, outlines the algorithm and proposes a way to construct a classification 
rule using the selected probesets. Section 4 provides some experimental results on real datasets, 
on the relevance of selected probesets, and on the sensitivity and the stability of the method. 
Finally Section 5 discusses the method. 



2 Probit mixed model for gene selection 
2.1 The hierarchical model 

Suppose that n binary events are observed, denoted by the 1^, z = 1, . . . , n. The set of potential 
regressors is of size p, with p ^ n. The goal is to select a subset of regressors related to the 
events li, . . . ,1^. The following probit mixed model is considered, 

P{Yi = l\U,P)=pi = ^Xl(3 + ZlU), 

where $ stands for the standard Gaussian cumulative distribution function, and Xi and Zi for 
the fixed and random effect regressors associated with the i*'^ observation. The parameter /3 
corresponds to the fixed-effect coefficients and the parameter U to the random-effect coefficients. 
X and Z are design matrices associated with the fixed and random effects. 
Assuming that we have K random effects, U = {U[, . . . , U'j^)' . Each Ui is of size qi, and J2iLi Qi — 
q. The size of j3 is p. 



Fohowing lAlbert and ChibI |l993l ] and iLee et all |2003l ]. a vector of latent variables L is 
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introduced. We write L = {Li, . . . , Ln) and we assume that L \ U, (3 ^ Mn^Xf] + ZU, In) with 
In the identity matrix. We then have 



1 if > 
if L, < 0, 



To perform variable selection, a vector 7 of p indicator variables is introduced: 

7j = 



1 if /3j ^ 0, variable j selected 
if /3j = 0, variable j not selected. 



Given 7, f3^ is the vector of all nonzero elements of /3, and is the matrix X with only the 
columns corresponding to the elements of 7 that are equal to 1. 



2.2 Prior distributions 

To complete the hierarchical model, some prior assumptions have to be made on [/ | D, | 7, 
7 and D, where D is a covariance matrix of dimension q. 

• If the data supports 7^ = over = 1, then the j*'^ variable will not be needed in the 
model and we can let l3i = 0. We then focus on the prior distribution of the non null 
vector Like Lee et al. 2003l |. we take the following conventional prior: 



p^\jr^Md{o,c{x'x^r^) 



with 



(2) 



This prior correspond s to the g-prior of IZellneii 198611 . and c is a positive scale factor 
specified by the user. Bottolo and Richardson 201Cll | called it th e variable select i on co - 
efficient. Several author s discussed the cho i ce of its v alue, see IChipman et al.l [20011 ]. 



George and Fosteii I2OOOII . IClvde and Georgd j2000l ] and ISmith and KohnI [1997t | among 



others. iRafterv et al.l 1997l | used a similar form of prior. In our algorithm the value of c 
will be fixed, but will not be too influent (see the discussion). 

The 7j are assumed to be independent Bernoulli variables, with 



PilJ = 1) = ^i, 



< VTj < 1. 



We do not want to use prior knowledge to favor any probesets, so we put ttj 
Vi = l,...p. 



The vector of coefficients associated with the random effects is assumed to be Gaussian 
and centered: 

U 1 D^f/q{0,D). 



This definition allows three cases to be distinguished: 
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General case: No structure is assumed for the variance-covariance matrix D, its prior 
distribution is an Inverse-Wishart W~^{'^,m). 

Case of a block- diagonal matrix D: The different random effects are assumed indepen- 
dent. The vectors of coefficients associated with each random effect have Gaussian prior 
distributions: 

Ui\Ai^Uq,{(},Ai), 1 = 1,. ..,K, 

where the Ai are symmetric design matrices of dimension qi. D is a block-diagonal matrix 
denoted by diag{Ai, . . . , Ak)- The prior distributions for each Ai are Inverse-Wishart 

Case of a diagonal matrix D: D = diag{Ai, . . . , Ak) where Ai = (jflqi, I = 1, . . . , K and 
Iqj the identity matrix. The prior distributions for the af are then Inverse Gamma (a, b) 
{b denotes the scale). 

3 Bayesian sampler for variable selection 
3.1 The conditional distributions 

The posterior distribution of 7 is of particular interest since it encapsulates the effectiveness of 
the different explanatory variables in explaining the variation in the responses Y. The number 
of possible explanatory variables is on the order of tens of thousands, so the number of possible 
7-vectors is extremely large. The idea is to use a Gibbs sampling algorithm to explore this 
posterior distribution and search for high probability 7 values. 

In order to use the classical Gibbs sampler, we must be able to simulate from all of the full 
conditional distributions (simplified by the hierarchical structure): f{L \ Y, (3, U), f{f3 \ L, U, 7), 
f{U I L, P, D), /(7 I L, U, P) and f{D \ U). 

• Full conditional distribution of L. 

Li\l3,U,Yi = l ~ Af{Xll3 + Z'^U, 1) left truncated at 

Li\l3,U,Yi = ~ ^^{X'i/3 + Z^U, 1) right truncated at 0. (3) 

• Full conditional distribution of /3. 

Given 7, we know which elements of P are not null. So we focus on the generation of the 
non null elements of jS-y. Letting Vj = j^{Xi^X-y)~'^ , we have 

p 

P-y\L,U,-f ~ AfdiV^X'^{L-ZU),V^) with d = Y,7i- (4) 

1=1 

• Full conditional distribution of U. 
Defining W = {Z'Z + £'"^)"\ we have 

U \ L,P,D r^^fg{WZ'{L- X/3),W). (5) 
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Full conditional distributions of 7. 



f{j\f3^,L,U) oc (27r) 2 exp 



77 



X C 



(^;^7r'r^n^?(i-^^-)'"'^- 



(6) 



Full conditional distribution of D. 

General case: The full conditional distribution of D is an Inverse-Wishart: 



D\U ^ W^{UU' + -^,m + l). 



(7) 



Case of a block- diagonal matrix D: D = diag{Ai, , 
tion of Ai (yi = 1, . . . , K) is an Inverse-Wishart: 



,Ak)- The full conditional distribu- 



Ai\Ui ~ W-\UiU[ + ^,m + l). 



(8) 



Case of a diagonal matrix D: D = diag{Ai, . . . , Ax), and V/ = 1, 
full conditional distribution of af is an Inverse-Gamma: 



,K, Ai = aflg,. 



Ui ~ ig(^^ + a,{^U[Ui + b 



The 



(9) 



3.2 Use of the grouping technique 

The classical Gibbs sampler cannot be used because the full conditional distribution of 7 cannot 
be directly simulated (see (H])). However, this full conditional distribution can be simulated with 
a Metropolis-Has tings algorithm, and the coni plete algorithm would be a Metropolis-within- 
Gibbs algorithm. Roberts and Rosenthal 2006l | have shown the Harris-recurrence of this algo- 
rithm, therefore its convergence is guaranteed. But even with a Metropolis-Hastings algorithm, 
the full conditional distribution of 7 is difficult to obtain, since it depends on the actual value 
of I3y. Thus the acceptance rate for a candidate 7* in the Metropolis-Hastings algorithm will 
depend both on the actual 7^*^ and I3^(t) , and on the proposed 7* and /3^* . The problem is that 
(3y* is unknown. 

To get around this problem, we combinethe Metropolis- within- Gibbs algorithm with the group- 
ing (or blocking) technique of iLiul 1994i |. The idea is to group the parameters and 7, so 
we will be interested in the full conditional distribution of (f3-y,j) \ L,U. Thi s techniqu e im- 
proves the algorithm an d facilitates the convergence of the Markov chain, see iLiul 1994i | and 
van Dyk and Park! 2008f |. We note that the sampler ob tained is then a special case of a Partial 
Collapsed Gibbs Sampler, see van Dyk and Park 20081 ] . 
As we have 

/(/3^,7 I L,^) oc /(7 I L,U)f{Py I -/,L,U), 

we remark that simulating from the full conditional distribution (/3^,7) \ L,U is equivalent to 
simulating 7 from its full conditional distribution integrated on /3^, then simulating from its 
full conditional distribution. The "integrated distribution" for 7 will not depend anymore on 
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the nuisance parameter (3^ and will be easily simulated by a Metropolis-Hastings algorithm. 
In each iteration of the algorithm, we will take care to simulate 7 before /37, to keep the depen- 
dence between /3j and 7, as noted by Ivan Dvk and ParkI |2008l |. 



We use /(L | 7, U) and the Bayes Theorem to get the integrated distribution of 7 | L, C/ (the 
target distribution): 



/(7|L,C/) oc (1 + c) 



E7. 



exp 



\{{L - ZUYiL - ZU) 



(10) 



1 + c 



{L - ZUyX^{X'^X^y'X'^{L - Z[/)}] X n^?(l - ^i)'" 



'7j 



3.3 The Metropolis-within-Gibbs sampler modified by the grouping tech- 
nique 

3.3.1 A Metropolis-Hastings step to simulate 7 

At iteration {i + 1) of the Metropolis-Hastings algorithm, a candidate 7* will be proposed from 
^(«). "We want a symmetric transition kernel, to simplify the acceptance rate of the algorithm. 
The simplest way to have a symmetric transition kernel is to prop ose a 7* which corresp onds 
to 7^^^ in which r coinponen ts have been randomly changed (see IChipman et al.l 200ll | and 



George and McCullochI [1997l |). 

Given the target distribution PU]) . the acceptance rate p is then: 

p(7«,7*) 




^Y^—^{L-ZUy(^Xj*{X'^,X^*) ^X'^, - X^(,){X'^(^i■,X^(^)) 

7T xE?(7;-7«) 1 



1 -TT 



(11) 



To facilitate the computation of the algorithm, the proposed 7* still corresponds to 7*^*-* for 
which r components have been changed, but in such a way that the number of components 
whose values are 1 (and so the number of selected variables) is invariant. In so doing, r/2 
components among the 1 values, and r/2 components among the values are chosen at random 
and switched. There are several advantages to propose such a 7*: 

• In an iteration of the algorithm, if we have the number of variables selected d higher 
than the number of observations n, the X'^X^ matrix would be singular, and the prior 
distribution of could not be defined as in ([2]). An advantage of fixing the number of 
variables to be selected at each iteration is that this number cannot increase during a run 
of the algorithm, and if d is chosen lower than n this case of non singularity of X!yX^ is 
avoided. 



The acceptance rate is simplified, as we obtain ^ (7!*^ — 7*) = 0. 
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The choice of the prior value for the variable selection coefficient c used in the prior 
distribution of (3 is less influent (see the discussion). 



Remark. In the method of lLee et al.l 2003l | , the 7 vector is generated component by compo- 
nent at each iteration, while in our method a Metropolis-Hastings algorithm is used to generate 
it. There are two advantages to use a Metropolis-Hastings algorithm: it is computationally 
advantageous for a very large number of variables compared to a generation component by com- 
ponent, and it enables us to easily generate a 7 vector with an invariant number of components 
whose values are 1. 



3.3.2 The sampler 

The Metropolis-within-Gibbs sampler modified by the grouping technique of Liu generates a 
sequence: 

7^^) , /3« ,D^'\ , C/(i) , , 7(^+-) , /3(''+-) , , , U^'+"'^ . 

The sequence of the 7'-*^ which is of interest for the variable selection problem, is embedded 
in this "Gibbs sequence". To generate it, at each iteration 7 is simulated from its integrated 
distribution and /3^,L,U and D are simulated from their full conditional distributions. 

Algorithm: 

Starting with initial values 7W, L>(o), L(°), [/(o). At iteration t + 1: 

1. Simulate 7(*+i) from f{j \ LW,[/W) ( see [TOj) . using the Metropolis-Hasting 
(MH) step. The MH step begins with 7^*^ as an initial value, and k iterations 
are performed given L^^^ and U^^^ {k arbitrarily fixed). At iteration i + 1 of the 
MH step: 

(a) Generate the 7* candidate, by randomly switching r/2 components among 
the 1 values, and r/2 components among the values. 

(b) Take 

{«+!)_/ ^* with probability pil^'Kl*) see pT]) 
\ 7^*) with probability 1 — p{'y'^^\^*) 

jiW) -win be the 7'-*^'' obtained at the k^'^ iteration of the MH step. 

2. Simulate from /(/3^ | L^, C/W, 7(^+1)) (see ©). 

3. Simulate from f{D \ f/W) (see ©, © or 

4. Simulate L(*+i) from /(L | y, C/W) (see ©). 

5. from f{U \ d(*+i)) (see 

We use the fact that X/3 = X^(3^ and that (3 can be obtained from 7 and /3^. The number of 
iterations is 6 + m, where h corresponds to the burn-in period and m to the observations from 
the posterior distributions. 
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In our application, we are not concerned by the strict convergence of the sampler. The 
aim is to find some relevant variables explaining the response, and to obtain good predictions. 
Hence we only need to do stability and sensitivity studies to check that the training set and the 
choices of the hyperparameters are not too influent, and we check the biological relevance of the 
variables selected. 



3.3.3 The selected probesets 

For selection of variables, the sequence {7W = {jP,.. . ,7^*^),t = 6+1,. . . ,6 + m} is used. The 
most relevant variables for the regression model are those which are supported by the data and 
prior information. Thus they are those corresponding to the 7 components with higher posterior 
probabilities, and can be identified as the 7 components that are most often equal to 1. 



3.4 Classification and prediction 

Once a set of relevant variables have been selected, it can be used to fit a probit mixed model 
in a classical way and to classify future observations. However, if more variables than necessary 
to fit a probit mixed model have been selected in the Bayesian selection step, a second selection 
has to be performed on them in order to build a reliable probit mixed model. This second 
selection is performed on the training set using standard selection tools like AIC, BIG, Bayes 
factors,. . . . The final probit mixed model can be tested on the validation set. Moreover, the 
variables selected in the Bayesian selection step can be used in other classification methods, such 
as Support Vector Machines (but random effects are not taken into account). 



4 Experimental results 



4.1 Application to the ER status of patients with breast cancer 
4.1.1 Description of the datasets 

Three different datasets were used: one private dataset from the Institut Paoli Calmettes (Mar- 
seille, France), consisting of 1 51 sa mples, and two datasets freely available from the NCBI GEO 



public website [Edgar et al.l . |2002| |: accession numbers GSE2109 (310 samples) and GSE5460 



(124 samples). Each dataset was treated for b ackground noise and normalized with respect to a 
reference distribution by the RMA procedure Irizarrv et al. . 2003l |. Each dataset was split into 
a training set and a validation set having the same proportions of ER positive and ER negative 
observations. Then the three training datasets were merged on one side (497 patients) and the 
three validation sets were merged on the other side (88 patients). 

For each patient, more than 54000 probesets were available. Two filters were applied on all of 
these probesets. Only the probesets sufficiently expressed so that they can be differentiated from 
noise and which could not be considered as invariant were kept, resulting in 19384 probesets. 
The goal was to select only a few probesets which are related to the ER status of the patients, by 
taking into account the different experimental conditions between the different merged datasets. 
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In this illustration, there are thousands of fixed regressors corresponding to the expression 
measurements of probesets, and only one random effect, which corresponds to the different 
datasets. Xij corresponds to the measurement of the expression level of the j^^ probeset for the 
i*^ patient, and Zn = 1 if the i^^ patient is from the l^^ dataset, otherwise. 



4.1.2 Prior settings for the algorithm 



Following the recommendations of ISmith and KohnI |l997l |. a value of c = 50 was chosen 



for the variable selection coefficient used in the prior distribution of /3. 

• Thirty probesets were selected at each iteration of the Gibbs sampler, when 7 is generated; 
r = 10 of them were changed at each iteration of the Metropolis-Hastings step (5 zeros 
and 5 ones). 

• The random effect corresponds to the dataset, and the three datasets are considered in- 
dependent: they were generated in different countries, by different teams, using different 
equipments and different patients. Therefore the variance- covariance mat rix of the ran- 
dom effect D was a diagonal matrix 3x3 with Ai = alh. iGelmanI [200d | noted that an 



inverse-gamma prior should not be too non-informative, otherwise serious problems can 
arise. Given our data, we knew that a\ is probably not too high, and a XQ{2,J>) seemed 
reasonable for the prior distribution of a^. 

For the Metropolis-within-Gibbs sampler modified by the grouping technique, 60000 iter- 
ations were computed, among which 30000 were burn-in iterations. For the Metropolis- 
Hastings step in this sampler, 500 iterations were computed, and the simulated 7 was the 
one corresponding to the 500*^^ iteration. 



4.1.3 Results and predictions 

We performed a first selection of variables by selecting the top-rank probesets, those which have 
been selected the most often. A boxplot can help, see Figured) Forty probesets were selected at 
least once from the 30000 post-burn- in iterations of the simulated Markov Chain for 7. Twenty 
three probesets were selected in the 30000 iterations, and thirty were selected at least from 20000 
iterations. There is a gap between probesets selected in more than 20000 iterations and others, 
so the first selection is made of these probesets selected at least in 20000 iterations. 
A second selection was performed on the thirty probesets from the first selection, to build a 
reliable probit mixed model. This second selection was performed on the training set, using a 
stepwise selection (with AlC and BIG criteria) and the classification performance of the model 
on the validation set. Five probesets were kept: Affymetrix symbols 228241_at, 205862_at, 
202376_at, 216222_s_at and 1568760_at. See Table [T] for the associated gene symbols and 
coefficients. The estimated random effects of this final model were reasonable: —0.284 for the 
dataset from the Institut Paoli Galmettes, 0.199 for the GSE2109 dataset and 0.087 for the 
GSE5460 dataset. 
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Figure 1: Boxplot of the number of selections of a probeset after the burn-in period, for the real 
datasets example. Forty probesets were selected at least once, all of the other probesets were 
never selected. A point represents a probeset (or several probesets if they have been selected 
the same number of times) . 



Probeset 


gene 


Coefficient 


Pvalue 


Intercept 




-9.12074 


1.92e-05 


22824 l_at 


AGR3 


0.45046 


1.12e-15 


205862_at 


GREBl 


0.77639 


4.18e-08 


202376_at 


SERPINA3 


0.37965 


0.000149 


216222_s_at 


MYOlO 


-0.63551 


0.004967 


1568760_at 


MYHll 


0.42742 


0.050219 



Table 1: Probesets selected in the final model and associated coefficients. 



Using this 5-probeset model, two methods were used to predict the ER status of the patients 
in the validation set: 

1. Using knowledge of the dataset to which each patient belonged and using the estimated 
random effects coefficients. 

2. The estimated random effects coefficients are not used in order to mimic a real-life scenario 
of an experiment for a patient coming from an unknown dataset. 

The patients were predicted positive if their probability to be positive was higher than 0.5 and 
negative if it was lower than 0.5. The two methods gave us the same predictions, which were 
very good: a specificity of 1 and a sensitivity of 0.98 (1 wrong predictions among 88), see Figure[2j 
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Probability to be ER positive for the validation set 



■ True ER positive 
SI True ER negative 



02 04 6 0.6 

Probabiiity to be ER positive 



Figure 2: Histogram of probabilities to be ER positive given by the final model, for patients 
from the validation set. 



Remark. In biomedical studies, when continuous variables are often reclassified as binary, it 
is common to define an "undetermined zone" of probabilities for which no prediction are given. 
Indeed, it is sometimes better than giving a wrong prediction, because these predictions imply 
treatments. Defining an "undetermined zone" between 10% and 90% probability of being posi- 
tive, false predictions were eliminated, and 10 were considered undetermined (11.4%) (estimated 
random effects coefficients not used). 

As a final test of our model, two more independent datasets were brought in from the NCBI 
GEO website: the GEO series GSE6532 and the GEO series GSE12763. The random effects 
associated with these datasets were entirely unknown, simulating an even more realistic case 
of prediction for a patient coming from an unknown dataset. Once again the results were very 
good : only 1 wrong prediction among 29 for the GSE12763 dataset, and no wrong predictions 
among 86 for the GSE6532 dataset. 



4.2 Sensitivity and stability studies 



The sensitivity and th e stability of the algorithm w ere assessed by using the relative weighted 
consistency measure of lSomol and Novovicoval [20081], denoted by CWrei- It is a measure evalu- 
ating how much subsets of selected variables for several runs overlap, and it shows the relative 
amount of randomness inherent in the concrete variable selection process. It takes values be- 
tween and 1, where represents the outcome of completely random occurrence of variables in 
the selected subsets and 1 indicates the most stable variable selection outcome possible. 
Stability is defined as sensitivity to variations in the training set. Referring to our breast cancer 
data set, 4000 probesets were randomly chosen from among the 19384 originally available. Since 



12 



the aim here was only to check the sensitivity and stabihty of the method, these 4000 were not 
chosen in relation to the ER status. 

Several runs of the algorithm were performed, and are reported in Table [2l Concerning the 
stability, the algorithm was run on three different training sets of 497 microarrays (among 585), 
using the same prior values for the hyperparameters. Concerning the sensitivity, the algorithm 
was run on the same training set with different values of c, different prior distributions for af, 
different numbers of probesets to be selected at each iteration of the algorithm and different 
numbers of iterations. For the prior distributions for af, we chose a ZQ{2,3) which seemed 
reasonable given our data, a XQ{2, 5) to have a prior favoring higher values compared to the first 
one, a IQ{3, 1) to favor lower values, and a IQ(1, 1) to h ave a non-informative prior without too 
small parameters to avoid problems, see ICelmanl [200d |. 



For each run, a reasonable number of probesets could be easily selected. Indeed, two to ten 
probesets were selected much more often than the others, see Figure [3] (two to four probesets 
were selected for most of the runs). Hence there is no need to perform a second selection, as 
in section 14.1.31 To compare the results of the different runs, the relative weighted consistency 
measure of Somol and Novovicova CW^ei was used. 



Simu 


Dataset 


Value 
of 

c 


Prior 
for 


Nb probesets to be 

selected at each 
iteration of the GS 


Nb probesets to be 

changed at each 
iteration of the MH 


Iterations 
for the 
algo 


burn-in 
for the 
algo 


CWrel 


1 
2 
3 


1 
2 
3 


50 


/G(2,3) 


15 


6 


12000 


6000 


0.25 


4 
5 
6 
7 


1 


10 

50 
100 
1000 


/G(2,3) 


15 


6 


12000 


6000 


0.375 


8 
9 

10 
11 


1 


50 


/G(2,3) 
IG{l,l) 
/G'(l,l) 
1^(2,5) 


15 


6 


12000 


6000 


0.292 


12 
13 
14 


1 


50 


/G(2,3) 


15 
5 

30 


6 
2 

10 


12000 


6000 


0.5 


15 


1 


50 


/G(2,3) 


15 


6 


30000 


15000 





Table 2: Parameters of the runs for the stability and sensitivity study and associated relative 
weighted consistency measure of Somol and Novovicova CWrei- For the Metropolis-Hastings 
step, always 500 iterations are computed. 
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Figure 3: Boxplot of the number of selections of a probeset after the burn-in period, for two 
runs of the sensitivity analysis. A point represents a probeset (or several probesets if they 
have been selected the same number of times). The left boxplot corresponds to the run with 
c = 1000: there is a gap between the two probesets selected in more than 4000 iterations and the 
others, hence we selected these two probesets. The right boxplot corresponds to the run with 
af ~ IQ{2,5): there is a gap between the four probesets selected in more than 1500 iterations 
and the others, hence we selected these four probesets. 

Using the results of the 15 runs together, CWrei = 0.398. Subsets of selected variables 
for the different runs overlapped: among the 15 runs the probesets 215552_s_at, 209603_at 
and 209602_s_at were kept in 12, 12 and 6 runs respectively. Apparently the prior for cj^ 
{CWrei = 0.292) has more impact than the number of probesets to be selected at each iteration 
{CWrei = 0.5). We note that the number of probesets to be selected at each iteration of the 
algorithm and the number of iterations do not seem to modify the number of probesets more 
selected than the others during the run. 

This was satisfying and the method appears relatively stable. First because the random 
selection of 4000 probesets carries a risk of destabilization of the results, since these 4000 are 
not necessarily those which are most indicative of ER status. Secondly, several probesets can 
represent the same gene, and different genes can be implied in the same biological pathway. 
Thus, it is possible that subsets of probesets are more similar than they appear, and therefore 
that CWrei is underestimated. For example, the probesets 209603_at and 209602_s_at men- 
tioned above both represent the gene GATA3. Finally, these simulations indicate that there is 
not a parameter whose choice introduces more sensitivity than the others. 
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4.3 Comparison with other methods 



We compared the performance of our method with the performances of other methods w hich 
do not take into account random effects: we considered the model of iLee et ah 



and 

Support Vecto r Machine wi t h Rec ursive Feature Ehmination of the variables, with linear or non 
linear kernels [Guyon et alJ . l2002l |. We used simulated data: 200 observations of 1000 variables 
following a uniform distribution ^-5,5] are generated. We assume that 5 variables and a random 
effect U of size 4 explain a vector of binary variables y by a probit mixed model: 



Pi = ^xl^ + ziu), i = l,... 

Yi ~ B{pi), i = l,...,200 



200 



We took P-y = (—1, —1, 1, 1, 2) and we assume that 50 observations are coming from each modality 
of the random effect. Different values of U were used: Ul = (0, 0, 0, 0), [72 = (-3, -2, 2, 3), C/3 = 
(-5, -3, 3, 5), C/4 = (-10, -5, 5, 10) and Uh = (-30, -10, 10, 30). This set of 200 observations 
was splitted into training and validation sets, each of them of size 100, with 25 observations 
coming from each modality of the random effect. For our method and the method of Lee et al. 
we took c = 50, 5 probesets were selected at each iteration of the Gibbs sampler and r = 2 of 
them were changed at each iteration of the Metropolis-Hastings step (1 zero and 1 one), D was 
a diagonal matrix 3x3 with Ai = and a prior TQ{1, 1) was chosen for a\, 500 iterations 
were performed for each Metropolis-Hastings step, and a total of 3000 and 5000 iterations were 
performed for the whole algorithm. 

Concerning our method and the method of Lee et al. 2003l | , the top-ranked variables (variables 
selected more often than others, box-plots were used) were used to perform predictions on the 
validation set. The RFE-SVM method gave us directly sets of "best variables" and associated 
models, and these models were used to perform the predictions. The results obtained are in 
Table El 



Random effect 


Our method 


Lee et al. [2003] method 


RFE-SVM 


U 


3000 iterations 


5000 iterations 


3000 iterations 


5000 iterations 


linear 


non linear 


Ul 


17 


26 


19 


22 


25 


23 


U2 


19 


21 


19 


19 


20 


26 


U3 


21 


23 


24 


24 


25 


26 


UA 


19 


19 


35 


35 


29 


31 


U5 


14 


11 


44 


44 


52 


56 



Table 3: Number of misclassifications on the validation set, for different methods and different 
random effects. 



When there is no random effect or when the magnitude of the random effect is small, our 
method is comparable to the one of Lee et al., and the results of these two methods are better 
than or comparable with those obtained by RFE-SVM. But when the magnitude of the random 
effect is high, especially for [74 and U5, it appears that our method outperforms the method of 
Lee et al. and the RFE-SVM method. 
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5 Discussion 



In this article we have developed an appro ach for Bayesian gene selectio n fo r a probit mixed 
model, as an extension of previous works bv lGeorge and McCullochI |l993l | and lLee et aD [2003l |. 
An important contribution of our method is that it allows selection of variables in a mixed 
framework, taking into account the design of the data. It is particularly useful for gene se- 
lection, as it enables the use of merged datasets in order to introduce more observations and 
greater diversity. That may provide improved power, and we can avoid bias due to a partic- 
ular dataset. The increased size of a merged dataset facilitates its re-splitting into training 
and validation sets, hence we do not need to evaluate the performance of a classification rule 
by a cross-validation procedure. It is advantageous compared to other methods which do not 
take into account random effects. Indeed, as these methods can use only one dataset which 
is usually of small size, they often need to perform leave-one-out-cross-val i dation , which can 
be tini e-con suming (see Lee et al. 2003l |. Yang and Song 201Cll |. Sha et al. 2004], Zhou et al 



2004bl | and I Zhou et al.l 2004cl | for instance) . On the contrary, if several datasets are merged 



then a separated training set can be used and the performance of a classifier can be directly 
obtained on it. Using simulations to make comparisons with other methods which do not take 
into account random effects, we showed that the proposed method is comparable to others when 
the magnitude of the random effects is low, but performs better than the others for classification 
when the magnitude of the random effects is high. This method should prove widely useful in 
microarray bioinformatics, since many diverse datasets are freely available on the Internet. But 
it can also be used for data obtained from high throughput sequencing technologies, which will 
probably be used a lot in few years. Indeed, the method can be applied when we have a matrix 
with n « p and an associated vector of random effects. 



In practice, before running an analysis, one must decide how many variables will be selected 
at each iteration of the Gibbs sampler. We do not consider this to be a drawback, since in 
order to have a reliable selection, the number of probesets should be limited compared to the 
size of the training set. Besides, fixing the number of variables selected at each iteration is 
a computational advantage, as discussed in section 13.3.11 In particular, the singularity of the 
X'^X^ matrix is avoided. 

In addition, one must choose a value for the hyperparameter c which is large enough to have a 
relatively non-informative prior. Only the simulations of (3^^ \ L,U,j and of 7 | L, [/ directly de- 
pend on c. Concerning | L, U, 7, we can see in ([4]) that the density is proportional to c/(H- c) , 
which is relatively close to 1 if c is large. Concerning the density of 7 | L,U (see (jlOp ). it depends 

on c/{l + c) and on (1 -|- c) 2^ . The factor (1 -|- c) 2^ does not play a role in the simulation of 
7, because the number of variables to be selected at each iteration is fixed: this factor vanishes in 
the acceptance rate of the Metropolis-Hastings step of the algorithm. Therefore the value chosen 
should not be too infiuent, as long as it is large enough. We chose arbitrarily c = 50, follow- 
ing Smith and Kohn's (1997) re c ommendations. However diffe rent authors suggested d ifferent 



Chipman et al 



2001 



George arid Foster _ 2000 1 ari d Clyde and George _ 2000tl among 



2004bl | and lZhou et al.1 |2004d | used c = 10, and lLee et al.1 |2003l | 



ranges, see 

others. For example IZhou et al 

used c = 100. However, it is possible t o include another l evel i n our Bayesian hierarchical model 



and to put a prior distribution on c. IZellner and Slow [l980^ for instance proposed a mixture 
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of g-priors and an inverse-gamma prior on c. Recently iBottolo and RichardsonI 20101 ] consid- 
ered putting a hyperprior on c and using a Metropolis-within-Gibbs with adaptive proposal for 
updating this coefficient. In our application this coefficient was held fixed for convenience, and 
good results were obtained. Besides the sensitivity study showed us that the method is not 
overly sensitive to the value chosen for c, as expected. 

More generally, it appeared that the algorithm is fairly stable to variations in the training set, 
and is robust to prior value of any of the hyperparameters. 



Convergence could not be verified because we did not have formal diagnostic tools to prove 
it, as the parameters vectors used in the proposed algorithm were not associated to the same 
variables from one iteration to the next. Besides, the different runs could have converged to a 
local mode of the posterior distribution of 7, and not to a global one. But the results obtained in 
the stability and sensitivity analyses were satisfactory, as different runs with different starting 
points and different prior hyperparameters selected broadly the same variables, which means 
that these different chains had basically the same behavior. From our experience, it appeared 
that having a total number of iterations equal to three times the size of the set of predictors is 
sufficient, the results were not significantly different when more iterations were performed. 



The probesets selected by our method to characterize the estrogen receptor status enabled 
us to fit a model with good predictions. Moreover, three genes among the five used in the model 
were also selected using a Support Vector Machine method (twenty-four genes were selected by 
SVM), and another group of three among tho se five is known to be associated with estrogen 
receptor pathways and breast cancer : GREBl iNagaraia et al.l . bood . iTownson and O'Connelll . 
200fi[ . iRae et~aD . l200,^ . SERPINA3 [Cimino et al.l l2008l | and MYHll ]singh and Chaturvedil . 
20091 ]. Therefore, it seems that the probesets selected by our method are quite biologically rel- 
evant. 



The algorithm developed is efficient and feasible, even for very large datasets with around 
20000 variables. Therefore this approach has a clear advantage over other selection methods 
which handle less variables or which do not take into account random effects. However, Bayesian 
variable selection is an active research area, and it would be interesting to combine our method 
with recent proposals. For instance by studying the performance of the method with oth e r pr ior 
distributions for o"^, like half-Cauchy or folde d-noncentral-^ distributions, s ee iGelmanI [2006] . 
Or by putting a prior distribution on c, like in Bottolo and Richardson 2010| ] . It would also be 
of interest to consider an alternative prior distribution for (3^ to handle a non-invertible X'^X^ 
(when 7 is itself singular or when n < d) , by combining our a pproach with the concept of ridge 
regression (work in progress. 
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